A Retinex-based network for image enhancement in low-light environments

Most of the existing low-light image enhancement methods suffer from the problems of detail loss, color distortion and excessive noise. To address the above-mentioned issues, this paper proposes a neural network-based low-light image enhancement network. The network is divided into three parts: decomposition network, reflection component denoising network, and illumination component enhancement network. In the decomposition network, the input image is decomposed into a reflection image and an illumination image. In the reflection component denoising network, the Unet3+ network improved by fusion CA attention is adopted to denoise the reflection image. In the illumination component enhancement network, the adaptive mapping curve is adopted to enhance the illumination image iteratively. Finally, the processed illumination and reflection images are fused based on Retinex theory to obtain the final enhanced image. The experimental results show that the proposed network achieves excellent visual effects in subjective evaluation. Additionally, it shows a significant improvement in objective evaluation metrics, including PSNR, SSIM, NIQE, and so on, when compared to the results in several public datasets.


Introduction
Images play an irreplaceable role in our daily life as a way to obtain information [1].However, the complicated shooting environment, different lighting conditions and other factors lead to unsatisfactory image acquisition, uneven illumination, low contrast and the presence of a large amount of noise, etc.They interfere with the image recognition in the subsequent processing.The low-light image enhancement technology can be used to make images clearer and reduce identification costs.The research on image enhancement methods in low-light environments is of great significance.
Existing enhancement methods can be classified into two main categories: traditional image enhancement methods and image enhancement methods based on deep learning.Traditional image enhancement methods are mainly used in industry.The main representative methods include grey scale transformation, histogram equalization, and the Retinex method [2][3][4][5][6][7].Among them, enhancement methods based on Retinex theory are widely used, such as [5][6][7].However, traditional methods are less sensitive to noise.They often result in color distortion and unsatisfied denoising.
Image enhancement methods based on deep learning have developed rapidly in recent years.LLNet was the first application of deep learning theories to low-light image enhancement [8].Afterwards, Retinex-Net used deep learning neural networks to the Retinex theory for image enhancement [9].However, they assumed "Ground Truth" image existing and therefore ignored the influence of noise on different regions, resulting bad detail restoration.[10][11][12] etc. avoided the need for "Ground Truth" reflectance and illumination images, but it overlooked detail optimization, such as structure and texture.To reduce reliance on paired datasets, some unsupervised methods were proposed.Zero-DCE became the first low-light enhancement network that operated independent of paired datasets [13].EnlightenGAN further minimized reliance on paired data [14].However, due to the lack of guidance from paired datasets, unsupervised methods were unable to effectively learn real-world scene features and had limited generalization capabilities.Other research methods [15][16][17][18][19][20][21] attempted to address the issues of unrealistic recovery effects and complex network scales by introducing new learning modules and attention mechanisms.Nevertheless, these methods had limitations in their experimental outcomes and lack robust generalization.For example, in different scenarios, the restoration effects of [13-16, 18, 21] were unstable, and the models lacked constraints and guidance for specific scenes.When strong light conditions occur, over-enhancement phenomena can be observed in [17,19,20].To address these limitations, we propose a new model based on Retinex.Compared to other advanced methods, the method proposed in this paper shows good restoration effects in complicated conditions such as extreme low-light and overexposure.It also shows excellent performance in image denoising and enhancement without the need of a large amount of training data.
This paper is arranged as follows.The Proposed Network section introduces the structure of proposed enhanced network.Experimental data sets and details of training sets are presented in the Experimental Process section.The Results and Analysis section analyses the experimental results of different methods.The Conclusions section draws the conclusion.

Proposed network
In this paper, a low-light image enhancement network combining Retinex theory [4] and a convolutional neural network is designed.The principle of Retinex is shown in Fig 1.
According to the Retinex theory, the illumination image represents the lighting conditions and the reflection image represents the texture information of the object.The enhancement of the original image is achieved by multiplying the illumination image and the reflection image.This relationship is expressed by Eq (1).
Sðx; yÞ ¼ Rðx; yÞ � Lðx; yÞ ð1Þ S(x,y) represents the image information received by the observer S. L(x,y) represents the illumination component of light.R(x,y) represents the reflection component of the object R.
Based on the Retinex theory, an image can be decomposed into a reflection component and an illumination component.For each component, a network is built.And an additional network is also needed to decompose the image.Therefore, the proposed network can be divided into three parts: the decomposition network, the reflection component denoising network, and the illumination component enhancement network.The overall network structure is shown in Fig 2 .The specific network design and the corresponding loss function for each subnetwork are demonstrated in the following part.

Decomposition network
The structure of the KinD [10] is adopted in the decomposition network.However, the KinD has problems of over-enhancement and visual defects.Therefore, in the first and third convolutional layers, the original activation function ReLU is replaced by the GELU, which exhibits stable optimization capabilities and excellent generalization.Compared to ReLU function, GELU function can better capture complex relationships in image data, aiding in enhancing the structural and textural information of the image.The smoothness of the GELU function reduces issue like gradient explosion or disappearance, resulting in superior performance in preserving image detail information and handling exposure.
Furthermore, to improve the accuracy of the network, a new structural similarity loss function SSIM [22] is added.SSIM is a metric used to measure image quality.It is mainly used to assess the structural similarity between two images.The SSIM loss function includes three aspects of image features: brightness, contrast and structure.By minimizing the SSIM loss function, the decomposed image can be made closer to the original image and can maintain a better perceptual quality.The details of the decomposition network are shown in Fig 3 .5 loss functions are used in the decomposition network.They are reconstruction loss function, reflection component consistent loss function, illumination component smoothing loss function, illumination intercorrelation loss function, and structural similarity loss function.The details of these loss functions are illustrated below.The reconstruction loss L ID rec is S l and S h denote the low-light image and the normal-light image, respectively.R l and R h denote the reflection component from the decomposition of the low-light image and the normal-light image, respectively.I l and I h denote the illumination component from the decomposition of the low-light image and the normal-light image, respectively.
The reflection component consistent loss L ID rs is The illumination component smoothing loss L ID is is r is the first-order derivative operator.� is a constant, here it is set to 0.01.The illumination intercorrelation loss L ID mc is c is the parameter that controls the shape of the function, here it is set to 10. G represents the sum of the gradients.
The structural similarity loss L ID SSIM is S out and S h denote the output image and the normal-light image, respectively.m S out and m S h denote the mean values of the output image and the normal-light image, respectively.s S out and s S h denote the standard deviation of the output image and the normal-light image, respectively.c 1 and c 2 are constants.The total loss function of the image decomposition network is λ rec , λ rs , λ is , λ mc , and λ SSIM are the weighting coefficients for reconstruction loss, reflection component agreement loss, illumination component smoothing loss, illumination intercorrelation loss, and structural similarity loss, respectively.λ rec , λ rs , λ is , λ mc , and λ SSIM are set to 1, 0.009, 0.2, 0.15 and 0.07, respectively.
We compare the effects before and after adding the SSIM loss function to KinD.The experimental results are shown in Fig 4 .Although KinD has a muddy shadow in some areas after adding SSIM loss function, the overall effect is better.The outcome aligns more closely with the visual perception of the human eye.

Reflective component denoising network
When the low-light image passes through the decomposition network, the reflection image retains the detail information.However, the noise in the low-light region is amplified at the same time.Therefore, it is necessary to denoise the decomposed reflection image.The structure of the Unet3+ [23] is adopted in the reflective component denoising network.However, the Unet3+ does not consider the extracting object size, which results in a mismatch between the receptive field and the scale.It leads to certain limitations in denoising.Therefore, CA attention [24] is added to the encoder part in Unet3+.CA attention combines channel attention and spatial attention to enhance the capture of direction and position information.It can help the network to adaptively learn the noise model of different regions in the image and make weighted estimates of the noise so that the network can more accurately recover parts of the signal and retain more detailed information.CA Attention can help the network to achieve The multi-scale structural similarity loss L ms−ssim is M denotes the total number of scales, here it is set to 2. μ p and μ g denote the mean of the denoised and normal-light reflectance images, respectively.σ p and σ g denote the standard deviation of the denoised and normal-light reflectance images, respectively.C 1 and C 2 are constants.σ pg denotes the covariance of the denoised and normal-light reflectance images.Both the β m and γ m components are set to 0.2856.
The detail loss L par is R h denotes the reflectance image of the normal-light image.R L denotes the denoised reflectance image.|| || 1 denotes the L 1 parametric regularization constraint on both.
The total loss function of the reflectance component denoising network is λ ms−ssim and λ par are the weighting coefficients of the multi-scale structural similarity loss and detail loss, respectively.λ ms−ssim is set to 1 and λ par is set to 0.009.
We compare the effects before and after adding the CA attention to Unet3+.The experimental results are shown in Fig 7 .The Unet3+ recovered image is brighter, but it suffers from increased blurriness and noise.Although the addition of CA attention reduces image The exposure control loss L exp is E is a constant.It is set to 0.6.M is the toal number of pixels.Y k is the mean value of a pixel region.The color constant loss L col is L col ¼ X 8ðp;qÞ2ε ðJ p À J q Þ 2 ; ε ¼ fðR; GÞ; ðR; BÞ; ðG; BÞg ð13Þ J p and J q are the luminance averages of color channel p and color channel q, respectively.(p,q) traverses all two-by-two combinations of three color channels.
The illumination smoothing loss L tv A is N denotes the iteration times.r x and r y denote the gradient operators in the horizontal and vertical directions, respectively.
The spatial consistency loss L spa is Y denotes the pixel value after enhancement.I denotes the pixel value before enhancement.O is the neighboring pixels of the pixel.
The total loss function of the illumination component enhancement network is W exp ; W col ; W tv A , and W spa are the weighting factors for exposure control loss, color constancy loss, illumination smoothing loss, and spatial consistency loss, respectively.W exp ; W col ; W tv A , and W spa are set to 10, 5, 200 and 1, respectively.
The results of different iteration times in the illumination component enhancement network are shown in Table 1 and Fig 9 .Table 1 shows that the effect reaches the best when n is 6.

Experimental data sets
In the training process, 485 groups of LOL dataset are used as the training set.The remaining 15 groups of LOL dataset are used as the test set.In order to verify the model effect, the paired datasets VE-LOL-L, SID and ELD are used as other test sets.The unpaired datasets DICM and MEF are also used as the test set.

Details of training process
The experiments are carried out under the framework Pytorch 1.

Results and analysis
Both subjective and objective visual evaluations are employed to evaluate the effects of image enhancement.To validate the necessity of each sub-network, we have conducted ablation experiments.In the objective visual evaluation, various representative metrics are used to assess the experiments.These metrics include peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and no-reference metrics, such as Natural Image Quality Evaluation (NIQE), Image Perceptual Quality (PI), No-Reference Quality Evaluation (BRISQUE), and Neural Image Assessment (NIMA).

Subjective visual evaluation
In   chairs and the water bottle in the image are lost, the overall visual effect is blurry.In the URetinex-Net, the Retinex decomposition and reconstruction for the low-light image is directly performed.Therefore, the URetinex-Net cannot completely recover the extremely low-light images.There are still some problems such as distortion in recovering image details.These problems can affect image quality and visual effect.
From Figs 10-15, the network proposed in this paper performs better compared to the other five methods.It proves the effectiveness and generalization of the proposed network.They are more in line with the visual perception of the human eye.

Objective metric evaluation
In the objective visual evaluation, several rigorous objective metrics are used to assess the performance comprehensively.These metrics include PSNR, SSIM, NIQE, PI, BRISQUE, and NIMA.Among these metrics, higher values for PSNR, NIMA, and SSIM indicate better image quality, while lower values for NIQE, PI, and BRISQUE indicate better visual image quality.The results with bold font in the table represent the best outcomes.
To maximize the accuracy, 15 images are selected on LOL dataset, 10 images are selected on the VE-LOL-L dataset, 10 images are selected on SID dataset, 10 images are selected on ELD dataset, 10 images are selected on DICM dataset, and 10 images are selected on MEF dataset as the test set.The average values of the six methods are calculated on different datasets.The experimental results are shown from Tables 2-7.
The experimental results on the paired datasets LOL and VE-LOL-L are shown in Tables 2  and 3.As shown in Table 2, the proposed network achieves best values of 22.4568, 0.8243, and 4.7531 in PSNR, SSIM, and NIMA respectively.These values are 13.18%, 8.52%, and 3.5% higher compared with the second highest value.According to Table 3, the proposed network achieves best values of 21.3067, 0.8943, and 29.3789 in PSNR, SSIM, and BRISQUE respectively.The values of PSNR, SSIM and BRISQUE are 6.75%, 0.85% and 8.9% better compared with the second-best value.
The experimental results on the paired datasets SID and ELD are presented in Tables 4 and  5. Table 4 shows that in SSIM, PI, BQISQUE, and NIMA, our method improves 17.04%, 6.48%, 3.59%, and 8.47% compared to the second-best values.In Table 5 our method improves 10.48%, 11.88%, 2.97%, and 10.02% in PSNR, SSIM, NIQE, and BRISQUE compared to the second-best values.
The experimental results on the unpaired datasets DICM and MEF are shown in Tables 6  and 7. From Table 6, it can be seen that the proposed network achieves best values of 2.1752 and 14.2458 in PI and BRISQUE, respectively.These values are 17.74% and 4.68% better than the second-best value.As shown in Table 7, the proposed network achieves best value of 2.5957 in PI.Although the values of metrics except PI are not optimal, the result achieves a more balanced performance.

Ablation study
To verify the necessity of the method framework proposed in this paper, we conducted ablation experiments by separately removing the denoising network and the enhancement network.The results of the experiments are shown in Fig 16 and Table 8.
From Fig 16(B), it can be observed that when the enhancement sub-network is present but the denoising sub-network is absent, the overall image exhibits excessive noise and unclear   The objective comparisons of ablation results for each module are presented in Table 8.It can be seen that both the absence of the denoising network or the enhancement network leads to relatively poor performance across multiple metrics.In contrast, our proposed network achieves the best results across all metrics, further demonstrating the effectiveness of the method proposed in this paper.

Conclusions
In order to further improve the effect of low-light image enhancement, a Retinex-based image enhancement network for a low-light environment is proposed.A new loss function, CA attention mechanism and the adaptive dynamic iteration method is introduced in the proposed network.Experiments show that most objective metrics have been improved.At the same time, the proposed network has a better denoising effect and the visual effect is more in line with human eye vision.It proves the effectiveness and generalization of the network proposed in this paper.

Fig 5
shows the effects of replacing the ReLU function with the GELU function in the decomposition network.Fig 5(B) depicts that the reflection image is blurry and has serious color distortion phenomenon when using ReLU.Contrarily, in Fig 5(C), using GELU results in more realistic colors and less noise.Fig 5(D) shows overexposure in the illumination image when using ReLU, whereas Fig 5(E) depicts the image clearer when using GELU.

Fig 9
also verifies that when n is 6 the image is more in line with the human subjective visual effect.

1 . 2 . 3 . 4 .
10.1, based on Python 3.7 with Cuda 11.1 environment.The Adam optimizer is used in the training process.The lowlight image enhancement is accomplished by the proposed three sub-networks.The experimental details and steps are illustrated below.Input the normal-light image and the low-light image into the decomposition network for decomposition.The learning rate of the decomposition network is set to 0.004.The batch size is set to 32.Input the decomposed reflection image into the reflection component denoising network for denoising.The learning rate of the reflection component denoising network is set to 0.001.The batch size is set to 1. Input the decomposed illumination image into the illumination component enhancement network for enhancement.The learning rate of the illumination component enhancement network is set to 0.001.The iteration times n is set to 6.The batch size is set to 8. Multiply the denoised reflection image R(x,y) and the enhanced image L(x,y) to obtain the final image.
the subjective visual evaluation, KinD, Retinex-Net, SCI, URetinex-Net, and Zero-DCE are selected for comparison.The final results on the test set are shown in Figs 10-15.

Table 3 . Comparison results of different methods on VE-LOL-L dataset.
dull color appearance.In Fig16(D), the contrast of the image is effectively improved, with clearer details and noticeable noise reduction.This clearly has shown that all the components are important for achieving better performance.